Depth and Autonomy
A Framework for Evaluating LLM Applications in Social Science Research
—APSA 2025—
Ali Sanaei & Ali Rajabzadeh
Outline
- a framework which has two dimensions depth and autonomy.
- a questionnaire to evaluate existing social science research and apply it to published research
- some results by which we try to say autonomy may be manageable.
Who wins mano a maquina?
“The cheese that the mouse that the cat that the dog that the boy that the teacher that the principal that the inspector noted reported warned scolded chased caught ate was moldy.”
Fill in the blanks: ___ had scolded ___.
The Rise of LLMs in Social Science
- Research is asked to be “validated” with some editor-dependent way.
- “Interesting” work is seeping through the filters.
- We seem to even lack the language to talk about how these models are used.
- Validity
- Reliability
- Replicability
Our Proposal: A Guiding Framework
We aim for higher research quality, transparency, and preserved human control.
We want a (two?)-dimensional framework to talk about the use of LLMs in social science research:
Critique, recommend, evaluate, …
The goal is to reap the benefits of LLMs while preserving transparency and reliability.
Characterizing LLM Usage: Many Dimensions
To build a useful framework, we must first understand the many ways LLM usage can vary. Let’s explore several key dimensions.
Dimension: Scope of Analysis
Refers to the unit of analysis on the input side.
- Word/Token: Part-of-speech tagging, named entity recognition.
- Sentence: Sentiment analysis.
- Paragraph/Chunk: Summarizing a specific section or a social media post.
- Document: Classifying an entire article.
- Corpus: Synthesizing themes across multiple documents.
Example: Moving from analyzing sentiment in a single tweet (Sentence) to identifying overarching themes in thousands of interview transcripts (Corpus).
Dimension: Reasoning Load
Indexes whether a task requires simple retrieval or complex, multi-step inference.
- Simple Recall: Extracting a date or name explicitly mentioned in a text.
- Multi-step Reasoning: Applying a complex coding rubric that requires checking multiple conditions before assigning a label.
Example:
- Low Load: “What state contains Albuquerque?”
- High Load: “Name all states that start with the same letter as the state containing Albuquerque, but do not contain Albuquerque.”
Dimension: Task Novelty
How familiar is the task to the model?
- In-training: Resembles tasks seen during training (e.g., summarizing news).
- Novel: A genuinely new problem or a unique combination of concepts.
Dimension: Analytical Logic
Describes whether the analytical categories are fixed beforehand or emerge from the data.
- Deductive (Fully Predefined):
- Applying a fixed, pre-existing codebook to a set of interviews.
- No new codes are allowed.
- Inductive (Fully Emergent):
- Performing open coding on focus group transcripts to generate themes from scratch.
- The categories are an output of the analysis, not an input.
Dimension: Iteration
Captures whether the research pipeline is a single-pass or multi-pass process.
- Single-Pass: The model executes the entire analytical task in one step.
- Example: A single prompt to code an entire interview.
- Multi-Pass (Iterative): The task is decomposed into sequential or parallel steps.
- This allows for human review, refinement, and greater control.
- Example: A multi-stage pipeline that first extracts quotes, then clusters them, and finally synthesizes themes.
Dimension: Epistemology
Situates the underlying philosophical stance of the research.
- Positivist:
- Assumes an objective reality to be measured.
- Emphasis on quantifiable tasks like content analysis with a predefined codebook, where replicability is key.
- Interpretivist:
- Assumes reality is socially constructed and subjective.
- Emphasis on exploring potential meanings, surfacing ambiguities, and generating initial interpretations for deeper human analysis.
Why These Two Dimensions?
Interpretive Depth
- maps onto the spectrum of qualitative methodologies, from descriptive content analysis to deep hermeneutics.
- is an intrinsic feature of the research question itself.
Realized Autonomy
- is the most consequential for evaluating the reliability and safety of the analysis.
- is a feature of the execution:the pipeline and workflow choices made by the researcher.
Focusing on Depth and Autonomy
Interpretive Depth
- The kind of inference the model is asked to perform.
- Ranges from surface-level extraction to deep hermeneutic analysis.
- Set by the research question.
Realized Autonomy
- The extent to which consequential choices are made by the model.
- Ranges from a simple tool to a delegated trustee.
- Set by the research pipeline.
The Autonomy-Depth Plane
A visual guide for designing and evaluating LLM applications.
- Low-autonomy configurations are safer, even for high-depth tasks.
- As tasks require deeper interpretation, the temptation to grant more autonomy increases.
- The top-right quadrant represents a High-Risk Zone, where high model autonomy is combined with deep, nuanced interpretation.
The Bounded-Autonomy Principle
Treat LLMs as capable but fallible research assistants, not as oracles.
Do all that we know is helpful, like:
- Decompose complex tasks into manageable, auditable steps.
- Provide clear rubrics, worked examples, and structured outputs.
- Require citations and direct textual evidence for all claims.
- Reserve critical interpretive decisions and conflict resolution for the human researcher.
Surveying the Field
How are LLMs actually being used in published social science research?
We systematically coded 56 published articles to map the current state of the literature onto our framework.
Survey Findings: A Diverse Landscape
Our analysis of published papers reveals:
- Wide variation in how researchers use LLMs.
- Studies are scattered across the Autonomy-Depth plane.
- No clear correlation yet between depth and autonomy.
- Transparency and evaluation practices are highly heterogeneous.
This variation highlights the need for a common framework.
Why Letter 53?
- Early governance text (657 CE): theology + administration
- Challenging content: long, dense, translation debates
- Good stress test for depth and value-grounding
Experiment 1: The Abstention Test
Objective: Assess whether an LLM will fabricate answers for an impossible task.
Design:
- Task: Find evidence of “bicameralism” in a 7th-century letter (an anachronistic and conceptually mismatched query).
- Conditions:
- Constrain output to 1-10 items.
- Provide an explicit abstention option (“There is no evidence for that!”).
Exp 1: Results
An explicit “exit path” is critical to prevent fabrication.
| 1-10 elements |
No |
7.36 |
| 1-10 elements |
Yes |
0.16 |
| 0-10 elements |
No |
5.26 |
| 0-10 elements |
Yes |
0.00 |
Takeaway: Without a way to abstain, the model will hallucinate to satisfy the prompt’s constraints.
Experiment 2: The Power of Decomposition
Objective: See how task decomposition can help reduce autonomy and evaluate how it may improve quality.
Design:
- Task: Extract elements of “constitutionalism” from the same 7th-century letter.
- Three Methods:
- Baseline: A single, complex prompt.
- Two-Stage: 1) Propose a coding schema, 2) Apply it.
- Multi-Stage: 1) Propose schema, 2) Apply to dimensions in parallel, 3) Synthesize.
Exp 2: Results
Decomposition yields more detailed, stable, and auditable results.
| Legal limits on rulers’ powers |
✓ |
9 |
9 |
| Supremacy of constitutional norms |
✓ |
8 |
9 |
| Procedural limits |
✓ |
9 |
9 |
| Amendment rules |
✗ |
2 |
0 |
| Consent in lawmaking |
✗ |
3 |
2 |
- All methods reached a similar high-level conclusion.
- However, the Baseline was a “black box” with limited detail.
- Multi-Stage provided the richest, most reliable, and fully auditable analysis.
Practical Takeaway
Break the task,
bind the output,
and climb the ladder of abstraction under human gaze.
Conclusion
- The Depth-Autonomy framework offers a structured way to design and evaluate LLM applications in social science.
- Constraining autonomy through task decomposition is the key to achieving reliable results for high-depth interpretive tasks.
- Explicit abstention options are crucial for preventing model fabrication and ensuring research integrity.
- Multi-stage pipelines produce more detailed, stable, and auditable outputs than single-pass approaches.